109 research outputs found
On the Use of Evaluation Measures for Defect Prediction Studies
Software defect prediction research has adopted various evaluation measures to assess the performance of prediction models. In this paper, we further stress on the importance of the choice of appropriate measures in order to correctly assess strengths and weaknesses of a given defect prediction model, especially given that most of the defect prediction tasks suffer from data imbalance.
Investigating 111 previous studies published between 2010 and 2020, we found out that over a half either use only one evaluation measure, which alone cannot express all the characteristics of model performance in presence of imbalanced data, or a set of binary measures which are prone to be biased when used to assess models especially when trained with imbalanced data.
We also unveil the magnitude of the impact of assessing popular defect prediction models with several evaluation measures based, for the first time, on both statistical significance test and effect size analyses. Our results reveal that the evaluation measures produce a different ranking of the classification models in 82% and 85% of the cases studied according to the Wilcoxon statistical significance test and Ă‚12 effect size, respectively. Further, we observe a very high rank disruption (between 64% to 92% on average) for each of the measures investigated. This signifies that, in the majority of the cases, a prediction technique that would be believed to be better than others when using a given evaluation measure becomes worse when using a different one.
We conclude by providing some recommendations for the selection of appropriate evaluation measures based on factors which are specific to the problem at hand such as the class distribution of the training data, the way in which the model has been built and will be used. Moreover, we recommend to include in the set of evaluation measures, at least one able to capture the full picture of the confusion matrix, such as MCC. This will enable researchers to assess whether proposals made in previous work can be applied for purposes different than the ones they were originally intended for. Besides, we recommend to report, whenever possible, the raw con- fusion matrix to allow other researchers to compute any measure of interest thereby making it feasible to draw meaningful observations across different studies
On the Relationship Between Story Point and Development Effort in Agile Open-Source Software
Background: Previous work has provided some initial evidence that Story Point (SP) estimated by human-experts may not accurately reflect the effort needed to realise Agile software projects. /
Aims: In this paper, we aim to shed further light on the relationship between SP and Agile software development effort to understand the extent to which human-estimated SP is a good indicator of user story development effort expressed in terms of time needed to realise it. /
Method: To this end, we carry out a thorough empirical study involving a total of 37,440 unique user stories from 37 different open-source projects publicly available in the TAWOS dataset. For these user stories, we investigate the correlation between the issue development time (or its approximation when the actual time is not available) and the SP estimated by human-expert by using three widely-used correlation statistics (i.e., Pearson, Kendall and Spearman). Furthermore, we investigate SP estimations made by the human-experts in order to assess the extent to which they are consistent in their estimations throughout the project, i.e., we assess whether the development time of the issues is proportionate to the SP assigned to them. /
Results: The average results across the three correlation measures reveal that the correlation between the human-expert estimated SP and the approximated development time is strong for only 7% of the projects investigated, and medium (58%) or low (35%) for the remaining ones. Similar results are obtained when the actual development time is considered. Our empirical study also reveals that the estimation made is often not consistent throughout the project and the human estimator tends to misestimate in 78% of the cases. /
Conclusions: Our empirical results suggest that SP might not be an accurate indicator of open-source Agile software development effort expressed in terms of development time. The impact of its use as an indicator of effort should be explored in future work, for example as a cost-driver in automated effort estimation models or as the prediction target
Search-based approaches for software development effort estimation
2011 - 2012Effort estimation is a critical activity for planning and monitoring software project development and for delivering the product on time and within budget. Significant over or under-estimates expose a software project to several risks. As a matter of fact under-estimates could lead to addition of manpower to a late software project, making the project later (Brooks’s Law), or to the cancellation of activities, such as documentation and testing, negatively impacting on software quality and maintainability. Thus, the competitiveness of a software company heavily depends on the ability of its project managers to accurately predict in advance the effort required to develop software system. However, several challenges exists in making accurate estimates, e.g., the estimation is needed early in the software lifecycle, when few information about the project are available, or several factors can impact on project effort and these factor are usually specific for different production contexts.
Several techniques have been proposed in the literature to support project manager in estimating software project development effort.
In the last years the use of Search-Based (SB) approaches has been suggested to be employed as an effort estimation technique. These approaches include a variety of meta-heuristics, such as local search techniques (e.g., Hill Climbing, Tabu Search, Simulated Annealing) or Evolutionary Algorithms (e.g., Genetic Algorithms, Genetic Programming).
The idea underlying the use of such techniques is based on the reformulation of software engineering problems as search or optimization problems whose goal is to find the most appropriate solutions which conform to some adequacy criteria (i.e., problem goals). In particular, the use of SB approaches in the context of effort estimation is twofold: they can be exploited to build effort estimation models or to enhance the use of existing effort estimation techniques. The usage reported in the literature of SB approaches for effort estimation have provided promising results that encourage further investigations. However, they can be considered preliminary studies. As a matter of fact, the capabilities of these approaches were not fully exploited, either the employed empirical analyses did not consider the more recent recommendations on how to carry out this kind of empirical assessment in the effort estimation and in the SBSE contexts. The main aim of the PhD dissertation is to provide an insight on the use of SB
techniques for the effort estimation trying to highlight strengths and weaknesses of these approaches for both the uses above mentioned. [edited by Author]XI n.s
Py2Cy: A Genetic Improvement Tool To Speed Up Python
Due to its ease of use and wide range of custom libraries, Python has
quickly gained popularity and is used by a wide range of developers
all over the world. While Python allows for fast writing of source
code, the resulting programs are slow to execute when compared
to programs written in other programming languages like C. One
of the reasons for its slow execution time is the dynamic typing
of variables. Cython is an extension to Python, which can achieve
execution speed-ups by compiler optimization. One possibility for
improvements is the use of static typing, which can be added to
Python scripts by developers. To alleviate the need for manual effort,
we create Py2Cy, a Genetic Improvement tool for automatically
converting Python scripts to statically typed Cython scripts. To
show the feasibility of improving runtime with Py2Cy, we optimize
a Python script for generating Fibonacci numbers. The results show
that Py2Cy is able to speed up the execution time by up to a factor
of 18
MEG: Multi-objective Ensemble Generation for Software Defect Prediction
Background: Defect Prediction research aims at assisting software
engineers in the early identification of software defect during the
development process. A variety of automated approaches, ranging from traditional classification models to more sophisticated
learning approaches, have been explored to this end. Among these,
recent studies have proposed the use of ensemble prediction models
(i.e., aggregation of multiple base classifiers) to build more robust
defect prediction models. /
Aims: In this paper, we introduce a novel
approach based on multi-objective evolutionary search to automatically generate defect prediction ensembles. Our proposal is not
only novel with respect to the more general area of evolutionary
generation of ensembles, but it also advances the state-of-the-art
in the use of ensemble in defect prediction. /
Method: We assess
the effectiveness of our approach, dubbed as Multi-objective
Ensemble Generation (MEG), by empirically benchmarking it
with respect to the most related proposals we found in the literature
on defect prediction ensembles and on multi-objective evolutionary
ensembles (which, to the best of our knowledge, had never been
previously applied to tackle defect prediction). /
Result: Our results
show that MEG is able to generate ensembles which produce similar
or more accurate predictions than those achieved by all the other
approaches considered in 73% of the cases (with favourable large
effect sizes in 80% of them). /
Conclusions: MEG is not only able
to generate ensembles that yield more accurate defect predictions
with respect to the benchmarks considered, but it also does it automatically, thus relieving the engineers from the burden of manual
design and experimentation
Agile Effort Estimation: Have We Solved the Problem Yet? Insights From A Replication Study
In the last decade, several studies have explored automated techniques to estimate the effort of agile software development. We perform a close replication and extension of a seminal work proposing the use of Deep Learning for Agile Effort Estimation (namely Deep-SE), which has set the state-of-the-art since. Specifically, we replicate three of the original research questions aiming at investigating the effectiveness of Deep-SE for both within-project and cross-project effort estimation. We benchmark Deep-SE against three baselines (i.e., Random, Mean and Median effort estimators) and a previously proposed method to estimate agile software project development effort (dubbed TF/IDF-SVM), as done in the original study. To this end, we use the data from the original study and an additional dataset of 31,960 issues mined from TAWOS, as using more data allows us to strengthen the confidence in the results, and to further mitigate external validity threats. The results of our replication show that Deep-SE outperforms the Median baseline estimator and TF/IDF-SVM in only very few cases with statistical significance (8/42 and 9/32 cases, respectively), thus confounding previous findings on the efficacy of Deep-SE. The two additional RQs revealed that neither augmenting the training set nor pre-training Deep-SE play lead to an improvement of its accuracy and convergence speed. These results suggest that using semantic similarity is not enough to differentiate user stories with respect to their story points; thus, future work has yet to explore and find new techniques and features that obtain accurate agile software development estimates
An Empirical Study on the Fairness of Pre-trained Word Embeddings
Pre-trained word embedding models are easily distributed and applied, as they alleviate
users from the effort to train models themselves.
With widely distributed models, it is important to ensure that they do not exhibit undesired behaviour, such as biases against population groups. For this purpose, we carry out
an empirical study on evaluating the bias of
15 publicly available, pre-trained word embeddings model based on three training algorithms
(GloVe, word2vec, and fastText) with
regard to four bias metrics (WEAT, SEMBIAS,
DIRECT BIAS, and ECT). The choice of word
embedding models and bias metrics is motivated by a literature survey over 37 publications
which quantified bias on pre-trained word embeddings. Our results indicate that fastText
is the least biased model (in 8 out of 12 cases)
and small vector lengths lead to a higher bias
Investigating the Effectiveness of Clustering for Story Point Estimation
Automated techniques to estimate Story Points (SP) for user stories in agile software development came to the fore a decade ago. Yet, the state-of-the-art estimation techniques’ accuracy has room for improvement.
In this paper, we present a new approach for SP estimation, based on analysing textual features of software issues by employing latent Dirichlet allocation (LDA) and clustering. We first use LDA to represent issue reports in a new space of generated topics. We then use hierarchical clustering to agglomerate issues into clusters based on their topic similarities. Next, we build estimation models using the issues in each cluster. Then, we find the closest cluster to the new coming issue and use the model from that cluster to estimate the SP.
Our approach is evaluated on a dataset of 26 open source projects with a total of 31,960 issues and compared against both baselines and state-of-the-art SP estimation techniques.
The results show that the estimation performance of our proposed approach is as good as the state-of-the-art. However, none of these approaches is statistically significantly better than more naive estimators in all cases, which does not justify their additional complexity. We therefore encourage future work to develop alternative strategies for story points estimation.
The experimental data and scripts we used in this work are publicly available to allow for replication and extension
How Do Android Developers Improve Non-Functional Properties of Software?
Nowadays there is an increased pressure on mobile app developers to take non-functional properties into account. An app that is too slow or uses much bandwidth will decrease user satisfaction, and thus can lead to users simply abandoning the app. Although automated software improvement techniques exist for traditional software, these are not as prevalent in the mobile domain. Moreover, it is yet unknown if the same software changes would be as effective. With that in mind, we mined overall 100 Android repositories to find out how developers improve execution time, memory consumption, bandwidth usage and frame rate of mobile apps. We categorised non-functional property (NFP) improving commits related to performance to see how existing automated software improvement techniques can be improved. Our results show that although NFP improving commits related to performance are rare, such improvements appear throughout the development lifecycle. We found altogether 560 NFP commits out of a total of 74,408 commits analysed. Memory consumption is sacrificed most often when improving execution time or bandwidth usage, although similar types of changes can improve multiple non-functional properties at once. Code deletion is the most frequently utilised strategy except for frame rate, where increase in concurrency is the dominant strategy. We find that automated software improvement techniques for mobile domain can benefit from addition of SQL query improvement, caching and asset manipulation. Moreover, we provide a classifier which can drastically reduce manual effort to analyse NFP improving commits
- …